I chose Prosper Loan dataset because it describes a new area (financial area) for me and therefore I will learn new things other that I will learn by dealing the red/white wine datasets (datasets seen before in term 1). Prosper marketplace is one of the first Peer-to-peer lending sites. It bring borrowers together on the same website with their loans. It is basically removing the middleman, which is the bank, and creates a mutually beneficial financial arrangement. Prosperdataset contains a large number of variables. The best way to understand them is by describing how a loan process works. Applying for a loan looks something like this:
My analysis is limited to the following variables:
Let’s get an overview on the dataset and its structure
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Prosper loan dataset consists of 113937 observations and 81 variables.
## [1] 113937 81
Q: The first question can be asked is what kinds of loans does Prosper make?
Listing Categories are:
## [1] "0 - Not Available" "1 - Debt Consolidation"
## [3] "2 - Home Improvement" "3 - Business"
## [5] "4 - Personal Loan" "5 - Student Use"
## [7] "6 - Auto, 7- Other" "8 - Baby&Adoption"
## [9] "9 - Boat" "10 - Cosmetic Procedure"
## [11] "11 - Engagement Ring" "12 - Green Loans"
## [13] "13 - Household Expenses" "14 - Large Purchases"
## [15] "15 - Medical/Dental" "16 - Motorcycle"
## [17] "17 - RV" "18 - Taxes"
## [19] "19 - Vacation" "20 - Wedding Loans"
As we see, the taller bar represents Debt Consolidation category. The most popular purposes for Prosper loans is to pay off credit card debt. “Not available” and “Other”" purposes came after.
Q: Term loans
As we can see, all loans have terms of one, three or five years.
As we can see, number of loans increases across years. We note that at 2009, the SEC ordered Prosper.com to stop making new loans, because federal law prohibits the sale or offer of securities without the company registering with the regulator. This is why we see a decrease in number of loans in this year. We see too that the number of loans in the first and 4th quater is greater than its number in the middle of year.
Q: loan Status
Completed loan status: is a loan that is fully paid off according to the schedule or prepaid. Current loan status: is a loan that is making payments on time and as agreed. ChargedOff loan status: is a loan with 5 missed payments (i.e. 150 days). Defaulted loan status: is a loan thah has a default reason, which is one of the followings:
Q: Prosper rating and Prosper score
## Warning: Removed 29084 rows containing non-finite values (stat_count).
We see listing Prosper ratings from the best(A) to the worst(HR). The taller bar represents listings without any Prosper Rating (NA). To discover/exploring his story, I facet the plot by ListingCreationYear variable. Prosper score is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.
As we can see in the plot, the bar bresenting no Prosper Rating exists only in 2005,2006,2008 and 2009. After 2009, the N/A bar disappeared. What I think about is the Prosper.com adds Prosper Rating feature after 2009.
Q: Distribution of loan amounts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Amount loan is limited between 1000 and 35000 dollars. It has a right skewed distribution. We see some descending peaks at 4000, 15000, 20000 and 25000 dollars.
Q: What about borrower rate ?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
BorrowerRate is the Borrower’s interest rate for a given loan. As can we see, it is nearly normally distributed.
Q: How many days take a loan to be be funded?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 9.00 12.34 13.00 1095.00
## Warning: Removed 1828 rows containing non-finite values (stat_count).
## Warning: Removed 2 rows containing missing values (geom_bar).
As we can see, a loan takes by median 9 days to be funded. The hostogram plot is limted at 100 days. There are many outliers in the NumberOfDaysLoanBeFunded data.
Q: Borrower Employment Status
As we can see, most borrowers are employed.
Q: Borrowers Occupation
##
## Accountant/CPA
## 3588 3233
## Administrative Assistant Analyst
## 3688 3602
## Architect Attorney
## 213 1046
## Biologist Bus Driver
## 125 316
## Car Dealer Chemist
## 180 145
## Civil Service Clergy
## 1457 196
## Clerical Computer Programmer
## 3164 4478
## Construction Dentist
## 1790 68
## Doctor Engineer - Chemical
## 494 225
## Engineer - Electrical Engineer - Mechanical
## 1125 1406
## Executive Fireman
## 4311 422
## Flight Attendant Food Service
## 123 1123
## Food Service Management Homemaker
## 1239 120
## Investor Judge
## 214 22
## Laborer Landscaping
## 1595 236
## Medical Technician Military Enlisted
## 1117 1272
## Military Officer Nurse's Aide
## 346 491
## Nurse (LPN) Nurse (RN)
## 492 2489
## Other Pharmacist
## 28617 257
## Pilot - Private/Commercial Police Officer/Correction Officer
## 199 1578
## Postal Service Principal
## 627 312
## Professional Professor
## 13628 557
## Psychologist Realtor
## 145 543
## Religious Retail Management
## 124 2602
## Sales - Commission Sales - Retail
## 3446 2797
## Scientist Skilled Labor
## 372 2746
## Social Worker Student - College Freshman
## 741 41
## Student - College Graduate Student Student - College Junior
## 245 112
## Student - College Senior Student - College Sophomore
## 188 69
## Student - Community College Student - Technical School
## 28 16
## Teacher Teacher's Aide
## 3759 276
## Tradesman - Carpenter Tradesman - Electrician
## 120 477
## Tradesman - Mechanic Tradesman - Plumber
## 951 102
## Truck Driver Waiter/Waitress
## 1675 436
## [1] 68
Borrowers mentioned 68 occupations. Professional (13628) and “other”(28617) are the most borrower occupations.
Q: Borrower Employment status duration
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
## Warning: Removed 7709 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Meadian of Borrowers EmploymentStatusDuration is abour 67 months(5.5 years). Mean is at 8 years.
Q:
proportion of Borrower Homeowner is very close to that of not Borrower Homeowner.
Q: Let’s know about the Debt To Income Ratio of the borrower.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
Most borrowers have DebtToIncomeRatio in (0,1] interval.
Q: Investors are the persons that funded a loan. Each loan has its number of investors
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
As we can see, many investors can funded one loan.
Q:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this graph, I plot the distribution of EstimatedReturn and EstimatedLoss. The variable values are assigned to the listing (loan before be funded) at the time it was created. They have a normal distrubution.
There are 113937 loans (or listings, the listing become loan when is funded) with 81 variables. There are many factor variables. In this analysis, I intersted to these factor variables: ProsperScore, LoanStatus, ProsperRating..Alpha., LoanOriginationAmount Occupation, EmploymentStatus and IsBorrowerHomeowner.
The main features in the data set are ProsperRating..Alpha., ProsperScore, LoanStatus, DebtToIncomeRatio. I’d like to determine which features are best for predicting if a loan will pay-off its loan on time.
investigation into your feature(s) of interest?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
I just ordered the Loan Status factor variable, to make the bar of “Due past >120 days” loan status coming after the “Due past(90-120 days)” bar in the plot.
Q: The introduction of this report stated that based on the ProsperScore and other information Prosper will obtain, a borrower will be assigned a ProsperRating (risk score), from A to HR. Now I w’d like to know how is the relationsip between these two variables: ProsperScore and ProsperRating using scatterplot.
As we can see, high ProsperScore leads to the best ProsperRating and vice versa. 1 ProsperScore is a the worst score lead to a worst ProperRating (E, HR). We see very few points in the ProperRating best region (A and AA). This is because ProsperRating is built depending on other information than the ProsperScore. We note that the plot suffers from some overplotting without jitter.
Q: What other information according to which the ProsperRating is calculated?
We see that despite the borrower is not homeowner can get A, AA and B prosper rating. Therefore the homeowner caracteristic is not used to determine prosper rating.
Q: Is DebtToIncomeRatio used to determine ProsperRating and BorrowerRate?
It is clear that it’s not used to determine Prosper Rating.
## Warning: Removed 8554 rows containing missing values (geom_point).
Nor to determine BorrowerRate
Q: Now, one can ask: Is the 5 years loan term available for all borrowers? In other word, is there a relationship between terms and Prosper Rating?
Loan is assigned by a Prosper rating at the time the listing was created. It takes the sympols: A(best) AA B C D E HR(worst). As we can see, term loan depend on the Prosper rating. Five-year term loan is available at all Prosper Rating levels, but only a three-year term is available on an HR Prosper rating loan.
Q: Let’s know if the ProsperRating affect the BorrowerRate.
loans with worst risk score (Prosper rating) has Higer BorrowerRate; a strong negative correlation. Ok, but why? Can be a way to recover the loan faster? We note that the plot suffers from some overplotting without jitter.
Q: Is there a relationship between Prosper rating and EstimatedLoss
## Warning: Removed 29084 rows containing missing values (geom_point).
A strong negative correlation between EstimatedLoss and ProsperRating. Best prosperRating borrowers have small estimated loss.
Q: Does the loan Term influence the BorrowerRate, let’s see.
What I thought that where the years Term increase the BorrowerRate increase too, like a traditional bank does. This is not what is shown in the box plot. It’s true that BorrowerRate median increase according to number of years in the Term, howeever three years Term has the larger box and range and has some ouliers. As we saw in univariate section, the number of 3 years term loans is the largest. what I think this term is the goal of this business.
Q: Let’s know about BorrowerRate and LoanOriginalAmount.
BorrowerRate range and density decreases as long as LoanAmount increases. I can’t do strong conclusion for these two variables with this plot as it is. I do think that other factor or factors affects the BorrrowerRate. I will explore that in the multivariate section. Otherwise, we can see some points far way to the global density; outliers with very small or 0 BorrowerRate. Let’s know who are :)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1415.5 287.0 530.0 599.5 964.1 1590.1 58848
I created a new variable NumberOfDaysPayOff to calculate number of days taken by the borrower to pay-off its loan. I used this variable to observe any anomaly in the loans with 0 BorrowerRate.
## LoanStatus ListingCategory..numeric. NumberOfDaysPayOff
## 1 Completed 0 32.68085
## 2 Defaulted 0 1243.91489
## 3 Completed 3 1118.29787
## 4 Completed 0 1119.31915
## 5 Completed 0 1119.31915
## 6 Completed 1 514.72340
## 7 Completed 0 214.46809
## 8 Completed 7 823.14894
## ProsperRating..Alpha. Term LoanOriginationYear LoanOriginalAmount
## 1 36 2006 1000
## 2 36 2007 1900
## 3 36 2008 1000
## 4 36 2007 3000
## 5 36 2006 5000
## 6 36 2008 25000
## 7 36 2006 1000
## 8 36 2008 3000
I tried to write a code to discover from where these outliers came. I selected some variables Which I suspect they caused these data. I can’t conclude ant thing! All 0 borrower rate loan are completed, i.e, the borrowers pay-off thiers loans.
Q: Let’s know about loan amount and Terms.
As was expected, when the loan amount increases the pay-off term also increases and vice varsa. This is true for one year term loan. For the two other terms, we can see that they have praticely the same LoanAmount range and a different mean and median (look at boxplot). What explain the same LoanAmount, a different term? No doubt a third variable.
Q: Does the Debt To Income Ratio influence the BorrowerRate?
There is no correlation between the two variables. It seems that the BorrowerRate is determined in isolation of the borrower financial situation.
Q:
In the above boxplot, we try to looking at how ProsperRating and LoanAmount relate with each other. Loan amount is the amount funded by investors. We note that it may be that the borrower is asking for an amount, but he will not have the full amount requested. As it is shown, the worst score risk loans E and H recive the lesser amounts, this is explained by the small range.
As we see in previous plot, number of loan with debt consolidation purpose is the larget among all other purpose. Likwise, the median of loan amount with this purpose is the biggest too. After come the ‘other’ purpose (8) then the Wedding Loans (20).
Q: Is the DebtToIncomeRatio relate to the Loan amount
## Warning: Removed 51377 rows containing missing values (geom_point).
They don’t relate to each other. I limited the x axis from 0 to 0.25 to explore better the data.
Q:
## Warning: Removed 29084 rows containing missing values (geom_point).
Estimated loss is the estimated principal loss on charge-offs. It is assigned to the listing at the time it was created ( before loan be funded). We can see that as long as the estimated loss increases the loan amount and density (number of loan) decrease.
Q: Does a lower risk score (ProsperRating) can lead more to a Chargedoff loan than the higher risk score?
Pink color in the bar chart represents the proportion of a chargedoff loan according to each score risk(Prosperrating). Whenever we move towards the worst ProsperRating, i.e, towards the higest risk score, the chargedoff proportion increases. This was expected. The scatterplot shows this clearly with borrowers density.
Q: In the plot below, I gathered all loan status that we can say that there is a problem with thiers borrowers in one group. This group does contain Completed, Cancelled, Current and FinalPaymentInProgress status. And I ploted the char bar of prosper rating by the status in the new group.
Despite D prosper rating loans are not the most numerous loan in the data (the C prosper rating loan is), we can observe that D prosper rating delinquent borrowers are more noumerous than any other borrowers even than to E and HR borrowers. This may be due to the fact that thiers prosper rating come in the middle between the good and worst prosper rating and this confuses investors.
Q:
This plot is when I add Term variable. Whtever the term, D prosper rating delinquent borrower exist satically. What I observed is the number of E prosper rating delinquent borrowers decrease when we pass to higer Term. The opposite occurs with A prosper rating delinquent borrowers.
Q:
As we can see, EmploymentStatus doesn’t influence the pay-off/not pay-off of the loan. Where we see that the Chargedoff and Defaulted loan can be the loan status of any borrower emplyement status.
investigation. How did the feature(s) of interest vary with other features in
the dataset?
A Strong positive relationship between: ProsperScore and ProsperRating; BorrowerRate and EstimatedLoss.
A positive relationship between: ProsperRating and LoanOriginalAmount; Term and LoanOriginalAmount.
A strong negative relationship between BorrowerRate and ProsperRating; EstimatedLoss and ProsperRating.
A negative relationship between EstimatedLoss and LoanOriginalAmount; BorrowerRate and LoanOriginalAmount.
Q: How BorrowerRate, LoanOriginalAmount and ProsperRating ralate to each other?
In this nice plot, we see that a [0, 0.25] BorrowerRate interval is reserved for borrower with the best ProsperRating (A , AA, B and C). Loan amount greater than 25000$ is allowed only to borrowers with A, AA, B prosper rating, where we see only 3 colors representing these prosper ratings.
Q:
In this plot, I tried to explore 3 variables as in previous plot, but this time I replaced LoanOriginationAmount by the Term. This plot says: HR prosper rating borrwers (brown color) cannot have 5 years term. 3 years term is more dense with largest BorrowerRate range. [0, 0.25] borrowerRate interval is reserved for borrowers with best prosper rating.
Q:
Another nice plot :). In addition to what we said on the previous plot, we add that borrowers with AA, A and B prosper rating are only the borrowers that theirs amount loans could exceed 25000$. This plot gathers all what we said ine the 2 previous plot.
Q:
Here, I just replaced BorrowerRate variable by EstimatedLoss. The two variables are strongly related. In this plot, the boundaries between categories and intervals are very clear.
Q: In the bivariate plots section, we saw that despite the E and RH prosper rating borrowers have higher borrowerRate and Estimatedloss than those in D prosper rating borrowrs, the latter are the most numerous. Why? I think the following plots answer us.
In the first graph I verify LoanOriginalAmount for the different ProsperRating. In the second graph, I ploted LOanOriginalAmount by BorrowerRate, and I colored the ProsperRating for only delinqued group. Green Color (D) is dense in this plot. As we can see, the loan amount makes the difference and therfore the MonthlyLoanPayment. Investors defund D perser rating loans almost like they defund B and C. A bad prosper rating with a challenging MonthlyLoanPayment lead to more delinqued borrowrs.
To create this model, first I supposed if a loan status is in this list: (Completed, Cancelled, Current, FinalPaymentInProgress) then its borrower is delinquent. I create IsBorrowerDelinquent independent variable.
Second, explanatory variables must not be strongly correlated each other. Therefore I chose EstimatedLoss rather BorrowerRate because the latter determine better the boundaries of prosper rating levels. The other explanatory variable is the LoanOriginalAmount.
## [1] 21213.25
##
## Call:
## glm(formula = IsBorrowerDelinquent ~ ., family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8353 -0.4839 -0.3649 -0.2976 2.7290
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.225e+00 8.592e-02 -37.540 < 2e-16 ***
## EstimatedLoss 1.292e+01 5.494e-01 23.515 < 2e-16 ***
## LoanOriginalAmount -2.411e-05 5.161e-06 -4.672 2.98e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13596 on 21212 degrees of freedom
## Residual deviance: 12629 on 21210 degrees of freedom
## AIC: 12635
##
## Number of Fisher Scoring iterations: 5
## [1] "EstimatedLoss coef 40.8399034023793"
## [1] "LoanOriginalAmount coef 0.999975890290644"
## [1] "LoanOriginalAmount coef 1.00002411029065"
Interpreting the results of our logistic regression model:
First of alla, we can see that EstimatedLoss and LoanOriginalAmount are statistically significant; thiers p-values are < 0.05. The smaller p-value the more statistically significant. Negatice sign indicates anegative correlation between predictor and independent variables. For one unit increase in Estimated Loss, delinquent borrowers are 40 times as likely, holding all other variables constant. For one unit deccrease in LoanOriginalAmount, delinquent borrowers are 1 time as likely, holding all other variables constant.
## [1] "Accuracy 0.900188560653677"
The 0.90 accuracy on the test set is quite a good result. When I added DebtToIncome variable I get 0.9060. I can’t consider it as a sognificant improvement.
yes there are. BorrowerRate, EstimatedLoss and ProsperRating are strongly correlated variables to each other. LoanOriginalAmount and ProsperRating, and LoanOriginalAmount and Term are two by tow positively corelated.
Yes, the number of D prosperRating delinquent borrowers are more numerous to the one od E and HR delinquent borrowers.
Yes, I created a logistic linear model starting from the IsDeliquentBorrower, EstimatedLoss and LoanOriginalAmount variables. I tries to add other variables to improve accuracy value but in vain.
I chose this plot to describe and sums up this busness. BorrowerRate doesn’t vary according to the loan term category. Exactly the opposite happens in traditional bank, where interst rate increases as long as the pay-off term increases. The magority of Borrowers are employed with a debt to income ratio in an interval of ( 0,1]. Borrowers ask more for a loan in 1st and 4th quarter of year.
These two plots sums up all what can be said about LoanOriginalAmout, Term, ProsperRating and BorrowerRate. In one side, they show a strong correlation between BorrowerRate, EstimatedLoss and ProsperRating to each other, and a positive correlation between LoanOriginalAmout and Tem/prosperRating in other side.
In this plot, I tried to find why D prosper rating delinquet borrowers are more numerous compared to E and HR prosper rating delinquet borrowers, despite E and HR prosper rating delinquet borrowers have higer BorroweRate and EstimatedLoss. I concluded that D prosper rating is a confused prosper rating for investors (lenders). It come between the a zone of good prosper ratings and the worst. Lenders fund better this prosper rating level compared to E and HR prosper rating levels and the highly relatively monthly loan payement become a challenge for borrowers with this level of risk score.
Peer to peer lending! This is the first time I hear about this kind of system. Along this journey, I learned new things and practiced what I saw in the previous lessons. At first, the large number of prosper data variables made me confused. But as I went step by step, I started to get used to it. At first I just tried to use 10 to 15 variables, but at the end of this data analysis I found myself looking beyond the 81 variables to build my model :) I started by understanding the provided description of variables. Then I asked myself some questions and I tried to find answers by exploring data with statistical summaries and graphics. There was a clear trend between ProsperRating and BorrowerRate/EstimatedLoss in one side and between delinquent borrowers and EstimatedLoss/BorrowerRate/LoanOriginalAmount in other side. For the logistic linear model, I included LoanOriginalAmount and EstimatedLoss to predict delinquency in borrowers. The model is able to predict delinquent borrowers with an accuracy of 0.90. It is a good value but is not enough. To improve this model, I would to know more about borrower behaviour by asking more question at the time he create his listing.